QTM 447 Lecture 9: Deep Neural Networks

Kevin McAlister

February 11, 2025

Neural Networks

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

A single layer, full connected neural network (FCNN) can be succinctly defined in two equations:

\[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = \expn \mathcal L(\theta_i | \mathbf x_i , y)\]

\[\theta_i = g(\bb^T \varphi(\mathbf W^T \mathbf x_i + \mathbf c) + b)\]

Neural Networks

\[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = \expn \mathcal L(\theta_i | \mathbf x_i , y)\]

\[\theta_i = g(\bb^T \varphi(\mathbf W^T \mathbf x_i + \mathbf c) + b)\]

  • \(K\) hidden units

  • \(\bb\) is a \(K\)-vector of output layer weights

  • \(b\) is an output layer bias term

  • \(\varphi()\) is an activation function (like ReLU) that induces nonlinearities elementwise

  • \(\mathbf W\) is a \(P \times K\) matrix of weights. Each column of \(\mathbf W\) corresponds to a single hidden unit.

  • \(\mathbf c\) is a \(K\) vector of bias terms.

Neural Networks

With large enough \(K\), a single layer FCNN is a universal approximator

  • It can uncover any function estimating (roughly) \(K \times P\) coefficients

Compared to other methods:

  • KNN is sort of \(\frac{N}{K}\)

  • Polynomial expansions are \(P^d\)

  • Kernel Methods are \(N\)

  • Trees are \(N\)

Note that the difference here is that \(N\) doesn’t play much of a role in determining how complex the NN approximator is!

Neural Networks

When \(N\) is really large, NNs are really state of the art for classification/regression

  • It’s just a bunch of regressions

  • Arranged in a really clever way

And they tend to scale quite nicely to big data!

Neural Networks

Another attractive feature of neural networks is explicit feature extraction

Recall that a FCNN can be thought of as a graph:

Neural Networks

We map the original feature space to the hidden feature space

\[\mathbf h_i = \mathbf W^T \mathbf x_i + \mathbf c\]

  • A mapping of the \(P\) dimensional space to a more useful \(K\) dimensional space

A reasonable question is whether or not the \(K\) dimensional space actually corresponds to anything meaningful

Neural Networks

To examine this, we’re going to look at the 3/8 problem using the MNIST handwritten digits data set.

Neural Networks

14780 instance of handwritten 3s and 8s

  • 784 pixel values in grayscale that correspond to one of the 28 by 28 pixel locations in the image

  • Small values correspond to background pixels and large values correspond to pencil pixels

Goal:

Build a classifier over the flattened pixels (e.g. ignore locational context) that separates 3s from 8s

Neural Networks

from sklearn.neural_network import MLPClassifier
import numpy as np

y = y.reshape(-1)

# Create a 1-layer neural network with 4 hidden features, a ReLU activation function, and the Adam optimizer
clf = MLPClassifier(hidden_layer_sizes=(4,), activation='relu', solver='adam', random_state = 123)

# Train the neural network on X and y
clf.fit(X, y)

# Compute the accuracy of the classifier
accuracy = clf.score(X, y)

# Get the predicted labels for X
y_pred = clf.predict(X)

# Find the indices of the misclassified examples
misclassified = np.where(y != y_pred)

# Get the misclassified examples
X_misclassified = X[misclassified]
y_misclassified = y[misclassified]

import math

# Determine the number of misclassified examples
num_misclassified = X_misclassified.shape[0]

# Calculate the number of rows and columns for the subplots
num_rows = math.ceil(math.sqrt(num_misclassified))
num_cols = num_rows

# Create a new figure
plt.figure()

# For each misclassified example
for i in range(num_misclassified):
    # Reshape the row into a 28x28 array
    image = X_misclassified[i].reshape(28, 28)

    # Add a subplot with the image
    plt.subplot(num_rows, num_cols, i+1)
    plt.imshow(image, cmap='gray')

plt.show()

Neural Networks

Neural Networks

Roughly a .03% error rate

  • Anyone know about how well regular ol’ logistic regression does in this case?

With only 4 hidden units, it takes sklearn about a second to train

  • 784 features

  • 14k observations!

Super scalable!

Neural Networks

The more interesting part - what do the hidden units look like?

Specifically, what mapping does the NN do in each hidden unit to put pixels into the hidden space?

In this case, all of our feature values are positive

  • Range between 0 and 255

For a single hidden unit:

\[h_{ik} = \varphi(\mathbf W_k^T \mathbf x_i + \mathbf c_k)\]

Neural Networks

Each weight for the hidden unit corresponds to a pixel

  • Positive weights multiplied by a positive value indicate that it is trying to activate that pixel in the hidden representation

  • Negative weights multiplied by a positive value indicate that it is trying to not activate that pixel in the hidden representation (e.g. push the inner product below zero so the ReLU just returns 0)

  • Weights that are essentially zero imply that the pixel doesn’t have any positive or negative relationship with the hidden feature that it is uncovering

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

The hidden layers are related to things that make 3s and 8s 3s and 8s

  • The empty space on the left hand side of a 3

  • The double circle structure of an 8

In a way, each hidden unit is attempting create a mapping from the feature space to a latent space where the latent space represents a major source of variation within the original feature space

  • With the end goal of effectively separating 3s and 8s

Neural Networks

Over all instances, let \(\mathbf Z\) be the collection of hidden values for each observation in each hidden feature (a \(N \times K\) matrix)

By definition:

\[\mathbf Z = \varphi \left( \mathbf X \mathbf W \right)\]

where \(\mathbf W\) is a \(P \times K\) matrix

This looks quite familiar…

  • \(\mathbf Z\) represents a latent space

  • Any thoughts?

Neural Networks

This is a nonlinear mapping of the original feature space to a latent space

  • Sort of like a nonlinear PCA…

Keep this in the back of you mind!

Neural Networks

Each hidden unit is creating a latent variable that relates to some aspect of the pixels that make it the digit.

The issue is that it’s trying to do everything at once!

  • Each hidden unit in a single layer model is finding a prototype and (more or less) measuring similarity to that prototype

Neural Networks

After creating the hidden space, the goal is then to try to find a linear separator in the hidden space that separates the classes!

  • Let’s just look at the first three hidden units since this covers most of the variation in the data set.

  • The ReLU activation function just ensures that the approach appropriately emphasizes features rather than trying to create a direct PCA space

Neural Networks

Neural Networks

Neural Networks

Neural Networks

What makes a 3 a 3? What makes an 8 an 8? What makes a 3 different from an 8?

Neural Networks

In reality, there’s a sort of hierarchical logic that helps us distinguish between digits

  • If it has a curvy top, a curvy bottom, and no pencil marks on the left, it’s a 3

  • If it has curves on the top and bottom and two sets of curves on the left and right, then it’s an 8.

Deep Neural Networks

A two layer, full connected neural network (FCNN) can be succinctly defined in two equations:

\[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = \expn \mathcal L(\theta_i | \mathbf x_i , y)\]

\[\theta_i = g(\bb^T \varphi(\mathbf W_2^T \varphi(\mathbf W_1^T \mathbf x_i + \mathbf c_1) + \mathbf c_2) + b)\]

  • \(K_1\) hidden units in the first layer

  • \(K_2\) hidden units in the second layer

  • Two sets of hidden weight matrices: \(\mathbf W_2\) (\(P \times K_2\)) and \(\mathbf W_1\) (\(K_2 \times K_1\))

Deep Neural Networks

Deep Neural Networks

Why depth rather than width?

DNNs with an equivalent number of weight vectors to single layer NNs:

  • Achieve higher accuracy with less hidden units

  • Faster computationally than equivalent single layer model

Deep Neural Networks

How are we able to achieve higher accuracy with a DNN?

It really comes down to how it constructs the hidden units!

Let’s go back to the MNIST data.

Deep Neural Networks

from sklearn.neural_network import MLPClassifier
import numpy as np

y = y.reshape(-1)

clf = MLPClassifier(hidden_layer_sizes=(4,3), activation='relu', solver='adam', random_state = 123)

# Train the neural network on X and y
clf.fit(X, y)

# Compute the accuracy of the classifier
accuracy = clf.score(X, y)

# Get the predicted labels for X
y_pred = clf.predict(X)

# Find the indices of the misclassified examples
misclassified = np.where(y != y_pred)

# Get the misclassified examples
X_misclassified = X[misclassified]
y_misclassified = y[misclassified]

import math

# Determine the number of misclassified examples
num_misclassified = X_misclassified.shape[0]

# Calculate the number of rows and columns for the subplots
num_rows = math.ceil(math.sqrt(num_misclassified))
num_cols = num_rows

# Create a new figure
plt.figure()

# For each misclassified example
for i in range(num_misclassified):
    # Reshape the row into a 28x28 array
    image = X_misclassified[i].reshape(28, 28)

    # Add a subplot with the image
    plt.subplot(num_rows, num_cols, i+1)
    plt.imshow(image, cmap='gray')

plt.show()

Deep Neural Networks

Deep Neural Networks

Note that this does a little bit worse on the training data!

  • .45% inaccuracy rate

  • Any thoughts as to why?

Ultimately, will make some nice pictures for understanding!

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

The first layer does okay at separating the 3s and 8s

  • But, not as good as the full NN

The goal of the first layer is not to classify!

  • The goal of the first layer is to arrange things in a way that can then be rearranged into optimal classification form.

  • A hierarchical approach!

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

With a two layer network:

  • First layer breaks the images up into a set of features (curves, thickness of lines, locations, etc.)

  • Second layer puts the new features into a classification formation

  • Output layer can draw a linear boundary!

Deep Neural Networks

DNNs work so well because they treat the feature extraction problem in a hierarchical way!

  • For classification, this is how a human would decide whether or not something is a 0 or a 1

  • With enough hidden units in the first layer, get very granular features that lead to easy decision making for the linear classifier

  • Not seeing the full benefit yet because this kind of feature structure isn’t that common in tabular data!

Deep Neural Networks

Why stop at two layers?

Deep Neural Networks

In theory, we could let our network be as deep as we want

The general expectation:

  • The first layer picks up a very granular set of features from the original feature space

  • The next layer aggregates those features into a different set

  • The next layer agregates again

  • On and on and on

Deep Neural Networks

This hierarchical feature extraction approach is the dominant one for images and text!

  • Sorta matches human methods of prediction

There’s also a nifty Bayesian interpretation of deep models over wide models

Deep Neural Networks

When we do machine learning, we’re setting prior beliefs about the kind of function that we should learn

  • A deep model is a very general belief that the overall function is a composition of a collection of simpler linear functions

  • What is a dog if not just a collection of lines arranged a specific way?

  • Each layer decreases the granularity of those simple linear approximations

  • A bunch of simple functions together estimate really complicated ones!

This is the basis of deep learning

  • The logic really works well for images and text

Deep Neural Networks

Are DNNs universal approximators?

NN Universal Approximation theorem:

Suppose we have a ReLU network with \(P\) inputs, \(D\) layers, and \(K\) units per layer.

The number of lienar regions carved out by this ReLU network is:

\[\mathcal O \left( {K \choose P}^{P(D - 1)} K^P \right)\]

  • With a depth one network, the number of regions increases in \(K^P\)

  • With more layers, we’re amping up the number of regions

  • The cost at each layer is linear in the number of layers - \(D \times K\) \(P\) vectors of coefficients

Deep Neural Networks

DNNs can create really complex decision surfaces with not that many vectors of coefficients!

  • It’s quite efficient, all in.

A first thought: Why not just fit skinny 1000 layer neural networks all the time?

  • Any thoughts?

Deep Neural Networks

Next week, we’re going to show that this isn’t computationally feasible due to the SGD training procedure

  • The gradients just don’t work well in this setting

Training DNNs

In theory, DNNs will always return what we want and interpolate the training data in a reasonable number of hidden units

  • The path from the intercept model to the interpolating model is quite short in terms of units needed

Training NNs isn’t all roses, though.

Training DNNs

Problem 1:

The NN objective function isn’t globally convex!

  • In fact, there are going to be lots of local minima along the way

Training DNNs

This is generically true but hard to prove.

So, let’s show this via a simple example.

Let’s consider a simple problem. 2 observations with one feature.

Training DNNs

\[\mathbf X^T = [1,-1] \text{ ; } \mathbf y^T = [1,-1]\]

Our loss is MSE:

\[\mathcal L(\boldsymbol \theta | \mathbf X , y) = \expn (y_i - \hat{y}_i)^2\]

Use a one layer NN with 2 hidden units and a ReLU activiation:

\[\hat{y_i} = \boldsymbol \beta^T \mathbf z_i + b\]

\[z_i = \varphi(\mathbf w^T x_i + \mathbf c)\]

Training DNNs

\[\mathbf X^T = [1,-1] \text{ ; } \mathbf y^T = [1,-1]\]

Let’s set all of the bias terms to zero. Consider two different sets of coefficients:

\[\mathbf w^T = [-1,1] \text{ ; } \bb^T = [-1,1]\]

\[\mathbf w'^T = [1,-1] \text{ ; } \bb'^T = [1,-1]\]

For each set of coefficients, compute the loss under MSE.

  • Don’t forget to apply the ReLU function elementwise!

Training DNNs

Convexity:

\[\mathcal L(\alpha\{\mathbf w, \bb \} + (1 - \alpha)\alpha\{\mathbf w', \bb' \} | \mathbf X , y) \le \mathcal L(\{\mathbf w, \bb \} | \mathbf X , y)\]

Setting \(\alpha = .5\), we get:

\[\mathbf w^T = [0,0] \text{ ; } \bb^T = [0,0]\]

What is the loss at this point?

Training DNNs

In general, the NN loss function is not convex in the unknowns!

  • There are a lot of little local minima

  • Not all are created equally!

Training DNNs

Training DNNs

We have a flat minimum and a sharp minimum for the empirical loss function

  • The global min is in the sharp min, but the flat min isn’t all that different.

We haven’t talked much about generalization for NNs (it’s coming), but this type of loss surface appears frequently for NNs and will have an effect on generalization

Which minimum do you think will generalize better?

  • What is our definition of overfitting?

Training DNNs

SGD is only guaranteed to converge to one of these local minima!

  • Depends a little on starting values

But, SGD is also more likely to converge to the flat one!

  • Any intuition?

  • Think about the noisiness inherent in SGD.

Training DNNs

SGD has also been shown to seek out flat minima that correspond to simpler models

  • An inherent L2 regularization in the SGD approach when compared to Newton-Raphson and other second order methods

A sort of funky result

  • See PML1 13.5.6 for more info

Training DNNs

Problem 2:

The gradient for a DNN can be really hard to work with

  • Lack of differentiability due to ReLU activations

  • Vanishing and exploding gradients due to activation function choice

  • Large hidden layers have very large Jacobians to describe the first derivative at unknowns

Training DNNs

For a generic DNN, the chain rule is still used, but it can get really complicated really quickly.

Fortunately, there is a famous algorithm that will allow us to train DNNs of arbitrary depth pretty efficiently: backpropogation

Training DNNs

Next class, we’ll discuss backpropogation and then address the problems with gradients for DNNs as much as possible

  • Even with all the fixes we can come up with, we may still be unable to really get close to a good minimum!